Intelligent RDD Management for High Performance In-Memory Computing in Spark

نویسندگان

  • Mingyue Zhang
  • Renhai Chen
  • Xiaowang Zhang
  • Zhiyong Feng
  • Guozheng Rao
  • Xin Wang
چکیده

Spark is a pervasively used in-memory computing framework in the era of big data, and can greatly accelerate the computation speed by wrapping the accessed data as resilient distribution datasets (RDDs) and storing these datasets in the fast accessed main memory. However, the space of main memory is limited, and Spark does not provide an intelligent mechanism to store reasonable RDDs in the limited memory. In this paper, we propose a fine-grained RDD checkpointing and kick-out selection strategy, by which Spark can intelligently select the reasonable RDDs to maximize the memory usage. The experiment is conducted on a server with four nodes. Experimental results demonstrate that the proposed techniques can effectively accelerate the execution speed.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Neutrino: Revisiting Memory Caching for Iterative Data Analytics

In-memory analytics frameworks such as Apache Spark are rapidly gaining popularity as they provide order of magnitude performance speedup over disk-based systems for iterative workloads. For example, Spark uses the Resilient Distributed Dataset (RDD) abstraction to cache data in memory and iteratively compute on it in a distributed cluster. In this paper, we make the case that existing abtracti...

متن کامل

GeoSpark: A Cluster Computing Framework for Processing Spatial Data

This paper introduces GeoSpark an in-memory cluster computing framework for processing large-scale spatial data. GeoSpark consists of three layers: Apache Spark Layer, Spatial RDD Layer and Spatial Query Processing Layer. Apache Spark Layer provides basic Spark functionalities that include loading / storing data to disk as well as regular RDD operations. Spatial RDD Layer consists of three nove...

متن کامل

Efficient In-memory Data Management: An Analysis

This paper analyzes the performance of three systems for in-memory data management: Memcached, Redis and the Resilient Distributed Datasets (RDD) implemented by Spark. By performing a thorough performance analysis of both analytics operations and fine-grained object operations such as set/get, we show that neither system handles efficiently both types of workloads. For Memcached and Redis the C...

متن کامل

Novel Apache Spark based Algorithm to Solve Dirichlet Problem for Poisson Equation in 3D Computational Domain

Corresponding Author: Shomanov Aday Department of Computer Science, al-Farabi Kazakh National University, Almaty, Kazakhstan Email: [email protected] Abstract: Parallel computations are essential tool in solving large-scale computationally demanding problems. Due to large diversity and heterogeneity of the currently available parallel processing techniques and paradigms it is usually diff...

متن کامل

The Effects of Spark Training on Visual-Spatial Working Memory Operation in Children with Mental Retardation

Introduction: Mental retarded children who receive a wide range of health services, representing more than two percent of the population. Mental retardation is associated with significant constraints on mental performance and adaptive behavior as well as perceptual and practical skills. According to the studies, one of the important tools that can affect cognitive abilities, such as memory, is ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017